[1]:
import numpy as np
import pandas as pd
import transportation_tutorials as tt
from transportation_tutorials.choice_theory_figures import *
Note: figures labeled 'interactive' will not actually be interactive unless they are
created in a live Jupyter notebook. If you're viewing this as a page on the course
static website, you'll be able to see the default figures but not change the sliders.

Estimating Model Parameters

Let us suppose we have some observations of trips taken by Car or Bus. We will represent the utility only with travel time and travel cost. Here we have a couple of observations:

[2]:
data2 = pd.read_csv(tt.data("data2_ca.csv"))
data2
[2]:
obs mode time cost chosen
0 1 bus 20 10 True
1 1 car 10 20 False
2 2 car 12 16 True
3 2 bus 21 14 False

We can plot each observation seperately, showing the chosen and unchosen alternatives. Since there are only two alternatives for each observation, the graph is quite simple.

[3]:
figure_two_observations(data2)
[3]:
2021-03-30T09:43:03.770170 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

In addition to the observations, these figures also include some notes about the implied utility of the two alternatives. Specifically, for each obs, the utility of the chosen alternative needs to be better (higher) than the utility of the non-chosen alternative.

To create a model for this, we need to find a set of parameters that will allow us predict the choices: the utility of each choice should be better than that for each non-choice. If we map a utility function onto these figures, both “better” arrows should be pointing uphill.

Let’s visualize some different fields of utility functions.

[4]:
figure_four_fields({
    'beta time = -0.1, beta cost = -0.1': lambda t,c: t*-0.1+c*-0.1,
    'beta time = -0.1, beta cost = -0.2': lambda t,c: t*-0.1+c*-0.2,
    'beta time = -0.5, beta cost = -0.1': lambda t,c: t*-0.5+c*-0.1,
    'a couple of wavy lines': lambda t,c: -0.1*(np.sin(t)+t)+c*-0.2,
})
[4]:
2021-03-30T09:43:04.387982 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

The contour lines on these fields are “iso-utility” lines, defined as a series of lines where the value of time (i.e. the ratio between the parameters on time and cost) are represented by the inverse of the slope.

In abstract theory, the field can take any mathematical form, not just a flat linear representation, but in practice if the utility field is not linear or some very rigorously refined non-linear form, finding good parameters for the model will be difficult, so we will assume a linear-in-parameters form.

We want to pick an set of parameters that define a field where the actually chosen alternatives are all uphill from the non-chosen alternatives.

Does this work?

[5]:
figure_two_observations_on_field(beta_time=-0.18, beta_cost=-0.1)
[5]:
2021-03-30T09:43:04.774033 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

How about this?

[6]:
figure_two_observations_on_field(beta_time=-0.01, beta_cost=-0.06)
[6]:
2021-03-30T09:43:05.210146 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

How about this?

[7]:
figure_two_observations_on_field(beta_time=-0.1, beta_cost=-0.2)
[7]:
2021-03-30T09:43:05.568381 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

With both arrows pointing uphill, we found a set of parameters that works to explain these observations.

Only differences in utility matter

This visualization is kind of busy. We can simplify it by reducing each observation to a single point, by mapping the differences in time and cost, instead of the raw values. (Hey, remember “only the differences matter”?)

Our current dataset has one row for each combination of obs and mode, but we might find it easier to collapse the data so that there’s only one row for each obs.

To make the change, first we’ll convert the index on the data. Originally this was a plain RangeIndex giving just the numerical position of each row. But we’ll make it be a two-level MultiIndex of meaningful data: the case observation and alternative id’s.

[8]:
data2.set_index(['obs','mode'])
[8]:
time cost chosen
obs mode
1 bus 20 10 True
car 10 20 False
2 car 12 16 True
bus 21 14 False

Then we can unstack to move the last level of that MultiIndex to the columns.

[9]:
data2_obs = data2.set_index(['obs','mode']).unstack()
data2_obs
[9]:
time cost chosen
mode bus car bus car bus car
obs
1 20 10 10 20 True False
2 21 12 14 16 False True

Now we can compute the differences in time and cost.

[10]:
diffs2 = pd.DataFrame({
    'timediff': data2_obs[('time','bus')] - data2_obs[('time','car')],
    'costdiff': data2_obs[('cost','bus')] - data2_obs[('cost','car')],
    'choice': np.where( data2_obs[('chosen','bus')], 'bus', 'car')
})
diffs2
[10]:
timediff costdiff choice
obs
1 10 -10 bus
2 9 -2 car

We can plot these points against a utility field, except instead of total utility for each alterntive we now show “net” utility between Car and Bus. That also means there’s no longer a representation of “uphill” to focus on, as each observation is only a single point. Instead, we can might look at positive and negative values in the field.

[11]:
figure_differences_on_field(diffs2, beta_time=-0.1, beta_cost=-0.2)
[11]:
2021-03-30T09:43:05.944415 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

This interactive figure (assuming you are in a live Jupyter notebook) makes it easier to explore a variety of different beta parameters.

[12]:
interactive_differences_on_field(diffs2, beta_time=-0.1, beta_cost=-0.2)

Side note: if you want to see the Python code used to create this interactive figure using ``plotly`` and ``ipywidgets``, check out the ``choice_theory_figures.py`` file.

Probability not utility

The color scale on this chart is “Net Utility of Car” but we’re not really interested in utility per se. What we’re more interested in is the choices – are we correctly predicting the usage of car and bus? So let’s apply the MNL transformation to change utility into probability, which can tell us if the model is making good predictions.

[13]:
interactive_differences_with_probability(diffs2, beta_time=-0.1, beta_cost=-0.2)

Now it’s a bit clearer how we can manipulate our parameters to keep the choices correctly segregated.

What about if we have a few more data points?

[14]:
pd.read_csv(tt.data('diffs9.csv'), index_col='obs')
[14]:
choice timediff costdiff
obs
1 car 9 -7
2 bus -8 -5
3 bus 4 -6
4 bus -7 -2
5 car 2 0
6 bus 1 -5
7 car -3 4
8 bus -2 -7
9 car 5 0

Can we find a solution that correctly predicts all the data points? This interactive figure marks observations that are unambiguously incorrect (less than 50% probability assigned to the actually chosen alternative) with an “X” and observations that are definitely correct (greater than 95% probability assigned to the actually chosen alternative) with a dot.

[15]:
interactive_differences_with_probability(diffs9, beta_time=-0.1, beta_cost=-0.2, correct=True)

With the observations above, and our time and cost levers, we can find a really nice solution where the model accurately predicts every choice almost certainly. But as we know, time and cost are not the only two relevant features of these modes. There’s a lot of other reasons why someone might choose one or the other. And, generally there’s a lot more than 9 trips we observe in our data. Inevitably, there will be some mixing of the data, so we can’t neatly divide it with a clean line.

[16]:
diffs19 = pd.read_csv('example-data/diffs19.csv', index_col='obs')
diffs19
[16]:
choice timediff costdiff
obs
1 car 9 -7
2 bus -8 -5
3 bus 4 -6
4 bus -7 -2
5 bus 2 0
6 bus 1 -5
7 car -3 4
8 car -2 -7
9 car 5 0
10 car 7 -6
11 car 8 -7
12 car -5 -3
13 car 3 -8
14 bus -9 -1
15 car 1 0
16 bus 0 -9
17 car -5 5
18 car -5 -5
19 car 7 1
[17]:
interactive_differences_with_probability(diffs19, beta_time=-0.1, beta_cost=-0.2, correct=True)

If our goal is to get everything right, it’s hopeless. We need to change our goal.

We could just try to maximize the total number of correct predictions, where the actually chosen alternative is the one with the highest modeled probability. This is how some machine learning algorithms tackle this problem. But it tends to be problematic when the choices are very lopsided, with most observations choosing one alternative. This happens a lot in transportation planning and especially in mode choices. So we’ll need something to get at more subtly that just “correct” and “incorrect”.

Likelihood

That something is called “likelihood”. We can assign a likelihood to each observation, which is the modeled probability that the choice is the actually observed choice. So, first we compute the probability of choosing each alternative.

[18]:
def compute_prob_car(timediff, costdiff):
    beta_time = -0.1
    beta_cost = -0.2
    return 1/(1+np.exp(
        timediff * beta_time
        + costdiff * beta_cost
    ))

p_car = compute_prob_car(diffs19.timediff, diffs19.costdiff)
p_car
[18]:
obs
1     0.377541
2     0.141851
3     0.310026
4     0.249740
5     0.549834
6     0.289050
7     0.622459
8     0.167982
9     0.622459
10    0.377541
11    0.354344
12    0.249740
13    0.214165
14    0.249740
15    0.524979
16    0.141851
17    0.622459
18    0.182426
19    0.710950
dtype: float64
[19]:
p_bus = 1-p_car
p_bus
[19]:
obs
1     0.622459
2     0.858149
3     0.689974
4     0.750260
5     0.450166
6     0.710950
7     0.377541
8     0.832018
9     0.377541
10    0.622459
11    0.645656
12    0.750260
13    0.785835
14    0.750260
15    0.475021
16    0.858149
17    0.377541
18    0.817574
19    0.289050
dtype: float64

Then we pick for each row the probability of the actually chosen alternative.

[20]:
current_likelihood = np.where(diffs19['choice'] == 'car', p_car, p_bus)
current_likelihood
[20]:
array([0.37754067, 0.85814894, 0.68997448, 0.75026011, 0.450166  ,
       0.7109495 , 0.62245933, 0.16798161, 0.62245933, 0.37754067,
       0.35434369, 0.24973989, 0.21416502, 0.75026011, 0.52497919,
       0.85814894, 0.62245933, 0.18242552, 0.7109495 ])

Then we want to find the joint likelihood of getting all these observations. Each observation is considered to be independent from the others, so the total likelihood is the cumulative product of all those individual likelihoods.

[21]:
current_likelihood.prod()
[21]:
6.820925494144253e-07

That’s a tiny number. Let’s take the log of it.

[22]:
np.log(current_likelihood.prod())
[22]:
-14.19810048535606

Or instead of taking the product then the log, we can take the log first, then the sum:

[23]:
np.log(current_likelihood).sum()
[23]:
-14.198100485356058

You might notice there’s a teeny tiny difference in the two results, that arises because of the computer’s numerical precision. That tiny error isn’t a big deal for this tiny data set, but as the data set gets bigger the errors compound. Plus, as the dataset gets bigger, the total likelihood shrinks exponentially, such that it will very quickly become too small a number to store in a 64-bit floating point value. (The smallest 64-bit number a computer can process is only about 4.94e-324.) The total log likelihood, on the other hand, decreases only linearly with the number of observations. Thus, we’ll be able to easily manange the total log likelihood for any size data set you’ll ever encounter.

Maximum likelihood

What we’ve done so far is to calculate a one-off log likelihood, for a particular set of beta parameters. But our reason for finding the likelihood is to be able to compare different possible sets of beta parameter values, and determine which parameters are better. Ultimately, we want to find the value of the beta parameters to get the best, maximum likelihood result. Or (equivalently) the maximum log likelihood, because log is a monotonic transformation. To do this, we need to turn the likelihood into a function instead of a one-time computation.

[24]:
def likely(betas):
    pr_car = 1/(1+np.exp(
        diffs19.timediff * betas[0]
        + diffs19.costdiff * betas[1]
    ))
    pr_bus = 1-pr_car
    likelihood = (
        (diffs19['choice'] == 'car') * pr_car
        + (diffs19['choice'] == 'bus') * pr_bus
    )
    return np.log(likelihood).sum()

You might notice our likelihood function takes the beta parameters as function arguments, holding the dataset fixed, while our earlier probability function took the dataset as an argument, and held the parameters fixed. Formally:

  • Probability is the chance of getting an outcome as a function of the data, given a model and its parameters: \(P_i(X | \beta)\)

  • Likelihood is the chance of getting an outcome as a function of the model parameters, given the data: \(L_i(\beta | X)\)

Having written up this function, we can now use it to get the log likelihood for any particular beta parameters, like this:

[25]:
likely([-0.1, -0.2])
[25]:
-14.198100485356058

Now, with this function, we can explore different values of the beta parameters, and try to find the maximum likelihood parameters.

[26]:
interactive_differences_with_probability(diffs19, beta_time=-0.1, beta_cost=-0.1, loglikelihood=True)

A few moments of playing with these levers should let you figure out that setting either parameter too far in either direction is bad for the log likelihood, and the highest values are in-between. This is good, as it will let us hone in on a finite “best” solution.

To help us visualize the likelihood for our simple example data with time and cost, we can compute the log likelihood for many different beta parameters values, and plot those results on a different figure.

[27]:
figure_log_likelihood(likely)
[27]:
2021-03-30T09:43:12.544463 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

The maximum likelihood is found at the peak of this hill. Finding that peak by brute force (i.e. plotting this whole figure) is a wildly inefficient way of finding the maximum likelihood, so we don’t generally do this. Instead, some algorithm is used to systematically converge to this point. The result is sometimes labeled as the “log likelihood at convergence”. We’ll see one such algorithm in a moment.

But, it is useful to see this visually in this simple two dimensional example to conceptualize the process. And it highlights another convenient mathematical features of using the MNL model with a linear-in-parameters utility function: the log likelihood function can be proven to be globally concave, meaning that there is always one and only one set of parameters that will maximize it, and they can be found by moving uphill from any point until the the top is found.

There are a lot of different tools and algoritms that can help us find the maximum likelihood, or for that matter the maximum or minimum of any arbitrary Python function. One of the simplest tools is the fmin function in the scipy.optimize package.

[28]:
from scipy.optimize import fmin

There’s no fmax, so we need to take the negative of log likelihood to turn our maximization problem into a minimization one.

[29]:
negative_likely = lambda b: -likely(b)

One nifty feature of the fmin function is the ability to attach a “callback”. This is an extra function that gets called with the current position at the end of each search iteration. It can allow us to monitor the current state of the algoritm, or store and later trace the trajectory followed by the algorithm to find the optimum.

[30]:
trajectory = []
[31]:
best = fmin(
    func=negative_likely,
    x0=[-0.1, -0.2],
    callback=trajectory.append,
)
best
Optimization terminated successfully.
         Current function value: 11.933198
         Iterations: 33
         Function evaluations: 63
[31]:
array([-0.14138137, -0.01490777])

Now we can look at what the fmin function did:

[32]:
figure_log_likelihood(likely, trajectory)
[32]:
2021-03-30T09:43:17.923183 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

Having found the maximum likelihood parameters, we can punch them in to our previous figure and see what the result looks like.

[33]:
interactive_differences_with_probability(diffs19, beta_time=best[0], beta_cost=best[1], loglikelihood=True)

Alternative specific constants

One interesting feature of the model so far is that the 50% iso-probability contour passes exactly through the origin at (0,0). This isn’t coincidental, it’s a definite outcome from the fact that the utility is a function only of cost and time, and nothing else. If the cost difference is zero and the time difference is zero, then the modeled utility difference of the two alternatives is exactly zero regardless of what value the beta parameters are set. We can fix this by adding an alternative specific constant to the utility function.

[34]:
def likely_2(betas):
    pr_car = 1/(1+np.exp(
        diffs19.timediff * betas[0]
        + diffs19.costdiff * betas[1]
        + betas[2]
    ))
    pr_bus = 1-pr_car
    likelihood = (
        (diffs19['choice'] == 'car') * pr_car
        + (diffs19['choice'] == 'bus') * pr_bus
    )
    return -np.log(likelihood).sum() # put the negative right here for convenience

We can add the new parameter to our interactive figure, and see if we can get a better log likelihood.

[35]:
interactive_differences_with_probability(
    diffs19, beta_time=best[0], beta_cost=best[1], beta_bus=0, loglikelihood=True
)

Looks like we can. We can use the optimization algorithm again to find the best values now for our 3 parameter model.

[36]:
trajectory_2 = []
best_2 = fmin(likely_2, [best[0], best[1], 0], callback=trajectory_2.append)
best_2
Optimization terminated successfully.
         Current function value: 10.500495
         Iterations: 88
         Function evaluations: 158
[36]:
array([-0.18669246, -0.16808146, -1.18576952])

We can relatively easily put this converged solution back into our interactive visualizer for this model.

[37]:
interactive_differences_with_probability(
    diffs19, beta_time=best_2[0], beta_cost=best_2[1], beta_bus=best_2[2], loglikelihood=True
)

But it gets harder to plot the three-dimensional log likelihood function. Here, we show three different two dimensional cross sections of the trajectory, and of the log likelihood function at the converged solution.

[38]:
figure_log_likelihood_3d(likely_2, trajectory_2)
[38]:
2021-03-30T09:43:34.961227 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/
[39]:
import plotly.express as px
fig = px.line_3d(
    pd.DataFrame(trajectory_2, columns=['time', 'cost',  'const_bus']),
    x='time', y='cost', z='const_bus',
)
fig.show()

Standard error of the estimate

You might be familiar with seeing “standard errors” on parameter estimates. These standard errors are a result of the shape of this likelihood function at the converged solution (i.e., a the peak of the hill).

  • If the shape of the hill is very pointy, and likelihood drops very quickly as we move away from the maximum, that tells us that the standard error of the estimate is small – we are very confident that the true value of the estimator is very close to that peak.

  • If, on the other hand, the top of the hill is relatively flat, and likelihood drops very slowly as we move away from the maximum, that tells us that the standard error of the estimate is large – we are not confident that the true value of the estimator is very close to that peak.

  • The hill can be peaked or flat by different amounts in different dimensions.

In mathematical terms, the standard error of the estimate is the inverse of the 2nd derivative of the log likelihood with respect to the parameters. Tools build specifically for discrete choice model estimation will be able to calculate this second derivative for you in an efficient manner. But seeing as we’re walking through this example to demonstrate how to do this in Python, let’s do it manually here.

In normal usage, as we’re seen, we can easily write functions using Python, numpy, and pandas, to compute the log likelihood. We could do a bunch of math to analytically derive the mathematical form of the derivatives (it’s not too hard for the MNL model) but with modern tools we don’t need to do all that work. Instead, in order to get the derivatives of the function, we’ll use one more library: autograd. This is a tool that allows for the automatic differentiation of functions. It’s not quite as easy-to-use as pandas, but with just a few tweaks we can get both the first and second derivatives of the log likelihood function.

[40]:
from autograd import grad, hessian
import autograd.numpy as anp
[41]:
# autograd doesn't play nicely with pandas, so
# we'll make all our data into numpy arrays.
choose_car = (diffs19['choice'] == 'car').to_numpy()
choose_bus = (diffs19['choice'] == 'bus').to_numpy()
timediff = diffs19.timediff.to_numpy()
costdiff = diffs19.costdiff.to_numpy()
[42]:
def likely_2a(betas):
    pr_car = 1/(1+anp.exp(timediff * betas[0] + costdiff * betas[1] + betas[2]))
    pr_bus = 1-pr_car
    likelihood = choose_car * pr_car + choose_bus * pr_bus
    return -anp.log(likelihood).sum()
[43]:
d_likely_2 = grad(likely_2a)
d2_likely_2 = hessian(likely_2a)

Now we can compute the gradient and hessian of the log likelihood function at any point.

[44]:
g = d_likely_2(best_2)
g
[44]:
array([-1.05492887e-04, -8.31110820e-04,  6.47935983e-05])
[45]:
h = d2_likely_2(best_2)
h
[45]:
array([[ 99.92733154, -15.58794645,  -2.90039936],
       [-15.58794645, 100.41639416, -12.88292984],
       [ -2.90039936, -12.88292984,   3.53169147]])

The standard errors are computed from the square roots of the diagonal elements of the inverse of the hessian.

[46]:
from scipy.linalg import inv
std_errs = np.sqrt(np.diag(inv(d2_likely_2(best_2))))
std_errs
[46]:
array([0.10863964, 0.14680194, 0.78263367])
[47]:
var_cov = inv(d2_likely_2(best_2))
var_cov
[47]:
array([[0.01180257, 0.00578134, 0.03078205],
       [0.00578134, 0.02155081, 0.08336112],
       [0.03078205, 0.08336112, 0.61251547]])

Goodness of Fit

The log likelihood achieved at the converged values, \(LL(\hat\beta)\) can be used to generate an overall measure of the “goodness of fit” of a discrete choice model. We can compare the log likelihood against better and worse models to create the measure.

The “best” possible model in this case is clearly well defined, as it would accurately predict every observation with certainty every time, yielding a likelihood of 1.0, and a log likelihood of zero.

The “worst” model is tricky, as we cannot simply take a model that never predicts the correct choice. Such a model would have a likelihood of zero, and a log likelihood of negative infinity. But, to create such a model, we would need to have some information about the process and use that information to intentionally “get it wrong” with certainty. Instead, we can contemplate a model that predicts every alternative as equally probable for every decision maker, using no information whatsoever to improve our predictions. This model would have non-zero probability for every chosen alternative, and has a well defined and finite log likelihood. We’ll call this model the “Null” model, and write the log likelihood of this model as \(LL(\emptyset)\).

Rho Squared

Because we use data to maximize the log likelihood, we are guaranteed to achive a log likelihood for a model that uses data somewhere in between these two extremes. The goodness of fit measure \(\rho^2_\emptyset\) (rho squared with respect to the null model) represents the fraction of the distance moved from the null model to the perfect model.

\[\rho^2_\emptyset = 1-\frac{LL(\hat\beta)}{LL(\emptyset)}\]
[48]:
rho_sq = 1-(likely_2(best_2)/likely_2(np.zeros_like(best_2)))
rho_sq
[48]:
0.20268359004349712

However, especially for mode choice models, there is very often a massive imbalance between the mode choices. You may find that the “drive alone” alternative is observed on well over half of all trips and representing it as “equally likely” against a variety of other alternatives results in an epicly poor model. We can counteract this by defining a different “bad” reference point: a model that consists only of alternative specific constants. This model will use no information other than the global market shares for each alternative, assigning the same relative probability for each alterantive to all decision makers.

The goodness of fit measure \(\rho^2_c\) (rho squared with respect to the constants only model) represents the fraction of the distance moved from the constants only model to the perfect model.

\[\rho^2_c = 1-\frac{LL(\hat\beta)}{LL(c)}\]

To get the best ASC holding the other parameters at zero, we just need to create a function that does that, and feed it to the same optimization engine as before.

[49]:
likely_2_const = lambda asc: likely_2([0,0,asc])
params_asc = [0,0,fmin(likely_2_const, 0)]
Optimization terminated successfully.
         Current function value: 12.504090
         Iterations: 23
         Function evaluations: 46
[50]:
rho_sq_asc = 1-(likely_2(best_2)/likely_2(params_asc))
rho_sq_asc
[50]:
0.1602351703587762

Adjusted Rho Squared

A problem with the rho squared measures is that they always improve when adding variables to the model, regardless of whether the variable is relevant or not.

To counter this, we can adjust the value to penalize models that add parameters without improving the log likelihood enough to be worth the extra degree of freedom.

This adjusted rho-squared is given by:

\[\bar\rho^2_\star = 1-\frac{LL(\hat\beta) - K_{\hat\beta}}{LL(\star) - K_{\star}}\]

Where \(K_{\star}\) is the number of parameters in the model \(\star\). For the null model, this is zero, and for the constants only model, it is the number of constants in the model (i.e. the number of alteratives minus one). Similarly, in the numerator \(K\) represents the number of parameters in the relevant model.

For this adjusted rho squared, the addition of parameters does not necessarily improve the goodness of fit.

[51]:
adj_rho_sq = 1-((likely_2(best_2)-3)/(likely_2(np.zeros_like(best_2))))
adj_rho_sq
[51]:
0.43047754386807024
[52]:
adj_rho_sq_asc = 1-((likely_2(best_2)-3)/(likely_2(params_asc)-1))
adj_rho_sq_asc
[52]:
0.348014926497468

You might notice we got a huge change from the regular rho square to the adjusted version here. That happened because the dataset used here is so tiny, only 19 binary observataions. For much larger datasets, the change from unadjusted to adjusted rho squares will be smaller.

Hypothesis Testing

Just as for linear regression models, discrete choice models offer a number of hypothesis tests to compare models against each other.

Single Parameter Tests

To test if a single parameter is significantly different from zero, we can use the \(t\) statistic, similar to as it is used in simple linear regression. For a parameter \(\beta_i\), which has an estimate \(\hat\beta_i\) with standard error of the estimate \(s_i\),

\[\frac{\hat\beta_i}{s_i} \sim t\]

The degrees of freedom are formally the number of observations in the sample minus the number of parameters in the model, but in practice this isn’t important, as the sample size used for any transportation planning model is generally large enough that the \(t\) distribution will converge to the normal distribution. This the critical value of \(t\) for 95% confidence is about 1.96, and for 99% confidence it is about 2.58.

[53]:
from scipy.stats import t

def t_test(parameter_value, std_err_est, deg_free):
    stat = parameter_value / std_err_est
    p = t(df=deg_free).sf(np.abs(stat)) * 2
    return f"{stat:.3f}, p={p:.3f}"
[54]:
t_test(best_2[0], std_errs[0], len(diffs19)-len(best_2))
[54]:
'-1.718, p=0.105'
[55]:
t_test(best_2[1], std_errs[1], len(diffs19)-len(best_2))
[55]:
'-1.145, p=0.269'
[56]:
t_test(best_2[2], std_errs[2], len(diffs19)-len(best_2))
[56]:
'-1.515, p=0.149'

Parameter Equality Tests

It is often interesting to determine if two parameters are statistically different from one another. This test is also based on the t-statistic, but it needs to account not only for the variance in the two estimators individually but also for the covariance (correlation) between them. If we define a null hypothesis

\[H_0 : \beta_i = \beta_j\]

then we can derive

\[\frac{\hat\beta_i - \hat\beta_j}{\sqrt{s_i^2 + s_j^2 - 2s_{ij}}} \sim t\]

A sufficiently large value of this \(t\) statistic will allow us to reject the null hypothesis with high confidence. In this case, that allows us to say that we are confident the true values of the two parameters are not the same.

[57]:
def t_test_2(parameter_value1, parameter_value2, std_err_est1, std_err_est2, std_err_est12, deg_free):
    stat = (
        (parameter_value1-parameter_value2) /
        np.sqrt(std_err_est1**2 + std_err_est2**2 - 2*std_err_est12)
    )
    p = t(df=deg_free).sf(np.abs(stat)) * 2
    return f"{stat:.3f}, p={p:.3f}"
[58]:
t_test_2(
    best_2[0], best_2[1],
    std_errs[0], std_errs[1],
    var_cov[0,1],
    deg_free=len(diffs19)-len(best_2)
)
[58]:
'-0.126, p=0.901'

Parameter Ratio Tests

We can use the same approach to check the ratio between two parameters. In this case, we define a null hypothesis

\[H_0 : \frac{\beta_i}{\beta_j} = K\]

with the value \(K\) being some fixed value. Or, equivalently

\[H_0 : \beta_i = K\beta_j\]

then we can derive

\[\frac{\hat\beta_i - K\hat\beta_j}{\sqrt{s_i^2 + K^2 s_j^2 - 2Ks_{ij}}} \sim t\]

A sufficiently large value of this \(t\) statistic will allow us to reject the null hypothesis with high confidence. In this case, that allows us to say that we are confident the ratio of the true values of the two parameters is not \(K\).

[59]:
def ratio_test_2(parameter_value1, parameter_value2, std_err_est1, std_err_est2, std_err_est12, k, deg_free):
    stat = (
        (parameter_value1-parameter_value2*k) /
        np.sqrt(std_err_est1**2 + k**2*std_err_est2**2 - 2*k*std_err_est12)
    )
    p = t(df=deg_free).sf(np.abs(stat)) * 2
    return f"{stat:.3f}, p={p:.3f}"
[60]:
ratio_test_2(
    best_2[0], best_2[1],
    std_errs[0], std_errs[1],
    var_cov[0,1],
    k=3,
    deg_free=len(diffs19)-len(best_2)
)
[60]:
'0.768, p=0.454'

Full Model Likelihood Ratio Test

Instead of looking at the parameters individually, we can evaluate the performance of the entire model. To do this we will us a likelihood ratio test. This test is valid when we can construct two distinct models that share the same structure, except one model has some restrictions placed on the parameter values and the other does not. These restrictions can take the form of fixing individual parameters to specific values (commonly, to zero) or fixing pairs of parameters to have particular ratios (commonly, one, which implies equality).

If the restricted model is the “true” model (our null hypothesis), then we would expect that the increase in likelihood found for the unrestricted model would be small. If that increase in likelihood turns out to be sufficiently large, then we can reject the null hypothesis, and state with confidence that the restrictions are not valid.

Using the likelihood ratio test involves evaluating the test statistic:

\[2\left( LL(U) - LL(R) \right) \sim \chi^2\]

where \(LL(U)\) is the log likelihood of the unrestricted model and \(LL(R)\) is the log likelihood of the restricted model. The \(\chi^2\) (chi squared) distribution is characterized by a degrees of freedom, which is equal to the number of restrictions imposed.

[61]:
from scipy.stats import chi2

x = np.linspace(0,20)
for degfree in [1,2,5,10]:
    plt.plot(x,chi2.pdf(x, df=degfree), label=f'deg freedom={degfree}')
plt.title("Chi Squared Probability Density Function")
plt.legend();
../../_images/course-content_choice-modeling_002-estimating-parameters_121_0.png

To find the level of significance for rejecting a restricted model using a chi squared test (i.e. the probability that the likelihood ratio would be at least so big), use the sf method of the chi2 distribution.

[62]:
chi2(df=5).sf(7.5)
[62]:
0.186029833602867
[63]:
from scipy.stats import chi2

def likely_ratio_test(loglike_uncontrained, loglike_contrained, deg_free):
    stat = 2*(loglike_uncontrained - loglike_contrained)
    p = chi2(df=deg_free).sf(stat)
    return f"{stat:.3f}, p={p:.3f}"
[64]:
likely_ratio_test(likely_2(best_2), likely_2(params_asc), 2)
[64]:
'-4.007, p=1.000'

Oof! A negative statistic for chi-squared (ahem, squared) isn’t right. We’ve forgotten that our likely_2 is actually returning the negative of the log likelihood. This wasn’t a problem up to now, as the negatives cancelled out in things like rho-squared, but here they don’t cancel, so we should fix this ourselves.

[65]:
likely_ratio_test(-likely_2(best_2), -likely_2(params_asc), 2)
[65]:
'4.007, p=0.135'

That’s better.